Chromosome-level genome assembly of the freshwater mussel Sinosolenaia oleivora (Heude, 1877)

Sinosolenaia oleivora (Bivalve, Unionida, Unionidae), is a near-endangered edible mussel. In 2022, it was selected by the Ministry of Agriculture and Rural Affairs as a top-ten aquatic germplasm resource, with potential for industrial development. Using Illumina, PacBio, and Hi-C technology, a high-quality chromosome-level genome of S. oleivora was assembled. The assembled S. oleivora genome spanned 2052.29 Mb with a contig N50 size of 20.36 Mb and a scaffold N50 size of 103.57 Mb. The 302 contigs, accounting for 98.41% of the total assembled genome, were anchored into 19 chromosomes using Hi-C scaffolding. A total of 1171.78 Mb repeat sequences were annotated and 22,971 protein-coding genes were predicted. Compared with the nearest ancestor, a total of 603 expanded and 1767 contracted gene families were found. This study provides important genomic resources for conservation, evolutionary research, and genetic improvements of many economic traits like growth performance.


Background & Summary
Freshwater mussels (Unionoida) represent the most diverse order of freshwater bivalves 1 and are found in all regions of the world except the Antarctic 2 .They not only play an important role in the food web structure and material cycle of ecosystems 3,4 but also have high economic value, such as for food 5 , pearl cultivation 6 , and anti-tumor ingredients 7 .They also have been used as an indicator for biological monitoring and evaluation of heavy metal pollution 8 .
Freshwater mussels are benthic filter feeders 9 .Suitable substrate, water quality, and food are important factors for the survival and reproduction of mussels.In recent years, human activities, such as river diversion, chemical pollution, and overfishing have caused serious damage to mussel habitats 10 .The developmental life history of most mussels involves a parasitic larval stage (glochidia) that must attach to vertebrate hosts (primarily fish) to complete metamorphosis 11 which increases their vulnerability 2 .The International Union for Conservation of Nature (IUCN) Red List reports that 173 species are extinct, endangered, or threatened, 99 are vulnerable or nearly threatened, and 84 are unclassified because data are deficient 12 .
There are 57 endemic species in China 13 , and eight species have now been listed as Grade II national protected animals 14 .The biodiversity and population size of freshwater mussels in large water bodies such as the Yangtze River 15 and the Songhua River 16 have shown a significant decline.S.oleivora is endemic to China.In 2022, S. oleivora was identified as one of the top ten characteristic aquatic germplasm resources by the Ministry of Agriculture and Rural Affairs.S. oleivora has fresh and tender meat, delicious taste, and high nutrient content 17 .In Fuyang of Anhui Province, Tianmen of Hubei Province, and other places, S. oleivora is a famous delicacy with a high economic value, and it is called "abalone in Huaihe River." It once ranged an extensive distribution-in five freshwater lakes and the tributaries of the Yangtze and Huaihe Rivers 18 .Habitat fragmentation and other human activities (e.g., overfishing) have resulted in their endangerment 19 .Tianmen in Hubei Province and Fuyang in Anhui Province has established the S. oleivora Nature Reserve to support this ecologically and economically vital resource.

Methods
sample collection and sequencing.One female S. oleivora was sampled from the national-level protection zone of the aquatic germplasm resource of S. oleivora in the Fuyang Division of Huaihe River (32.428725°N, 115.600287°E).Total DNA was extracted from the adductor muscle of S. oleivora using the DNeasy Blood and Tissue Kit (Qiagen, Germany) for genome sequencing.For short-read sequencing, Covaris M220 was used to break DNA into 300-350 bp fragments.DNA library preparation was completed by terminal repair, an A-tail addition, sequencing junction addition, DNA purification, and bridge PCR.Based on a paired-end(PE) sequencing strategy.These libraries were sequenced on the Illumina NovaSeq Nova 6000 platform.For long-read sequencing, according to the PacBio standard protocol, a PacBio HiFi library was generated using an SMRTbell Template Prep Kit 2.0 (Pacific Biosciences, USA) and sequenced using the PacBio Sequel II platform.A Hi-C library was prepared following the Hi-C library protocol 29 and sequenced using the Illumina Novaseq 6000 platform.Total RNA was extracted from the adductor muscle of S. oleivora using TRIzol reagent (Invitrogen, MA,  estimation of genome size.A K-mer-based method 30 was applied to estimate the genome size, heterozygosity, and repeat content in S. oleivora.We performed a k-mer (k = 17) frequency distribution analysis using 192.1 Gb of Illumina clean data (Fig. 2).A total of 153,573,141,235 k-mers with a depth of 73 was obtained.The genome size was 2,025 Mb, the heterozygosity ratio was 0.78%, and the repeat sequence ratio was 61.37%.2).

Hi-C-assisted chromosome-level assembly.
To assemble the chromosome-level genome, Hi-C sequencing data were mapped and sorted against the draft genome assembly with Juicer v1.6 software 33 .The contigs were linked to 19 distinct chromosomes by 3D-DNA (v.180922) 34 .Based on chromosome interactions, the contig orientation was corrected and suspicious fragments were removed from the contigs in the Juicebox software 35 .The genome contigs were further anchored and oriented to chromosomes by Hi-C scaffolding.The Hi-C library generated 191.8.2 Gb of clean data, with 55.56% valid pairs.A total of 302 contigs, accounting for 98.41% of the total assembled genome, were anchored into 19 chromosomes.The 19 pseudo-chromosomes were clearly distinguished from the Hi-C heatmap with strong pseudo-chromosome interactions confirming high-quality Hi-C assembly (Figs. 3, 4).This resulted in a high-quality genome of 2052.30Mb, with a contig N50 of 20.36 Mb and scaffold N50 of 103.57Mb (Table 3).
The genome sequence was soft-masked based on repetitive element predictions and then used for protein-coding gene prediction.We employed three methods for gene prediction.For homology-based annotation, the protein sequences of Mizuhopecten yessoensis, Crassostrea gigas, Crassostrea virginica, and Mytilus galloprovincialis were downloaded from NCBI and aligned to the genome sequence using BLAST(E-value: Table 10.Protein-coding genes under positive selection in Sinosolenaia oleivora (FDR < 0.05).
Fig. 7 GO enrichment analysis of positively selected genes.
1e-5) 41 .Homologous sequences were then aligned to corresponding matching proteins using GeneWise (v.wise2-4-1) 42 .For the RNA-seq-based annotation, transcriptomic data were assembled using Trinity v2.11 43 , and BLAST(E-value: 1e-5) 41 to align transcriptome to the genome.For de novo prediction, Augustus(v3.4.0) 44 , and Genscan (version1.0) 45were used to generate de novo-predicted gene sets.Maker (v2.31.10) 46 was used to integrate the results from these methods to produce the final gene set.The genome sequence was also aligned to the homologous single-copy gene database of Benchmarking Universal Single-Copy Orthologs(BUSCO) 47 .MAKER (version 2.31.10) 48and HiCESAP (Wuhan Gooalgene Co., Ltd., https://www.gooalgene.com/)were employed to merge all the data and filter out redundancies.The combination of de novo and homolog-based methods predicted 22,971 protein-coding genes (Table 6).The predicted genes were functionally annotated based on exogenous protein databases including SwissProt, InterPro, TrEMBL, Kyoto Encyclopedia of Genes and Genomes (KEGG), and Gene Ontology (GO).A total of 19,229 genes, accounting for 87.52% of all predicted genes, were annotated using public databases (Table 7).

Data Records
All sequencing data from three sequencing platforms have been uploaded to the NCBI SRA database (transcriptomic sequencing data: SRR28352171 61 , genomic Illumina sequencing data: SRR26551344 62 , genomic PacBio sequencing data: SRR28406055 63 , Hi-C sequencing data: SRR28406264 64 ).The final chromosome-level assembled genome file has been uploaded to the GenBank database under the accession JBDPLI000000000 65 .Genome annotation files have been uploaded to the Figshare database 66 .

technical Validation
evaluating the quality of the DNA and RNA.The quality and concentration of extracted DNA/RNA were assessed using NanoDrop 2000 Spectrophotometer (Thermo Fisher Scientific, San Jose, CA, USA) and Qubit 3.0 Fluorometer (Thermo Fisher Scientific, San Jose, CA, USA)(OD260/280 and OD260/230) before the genome sequencing and their integrity was further evaluated on 1% agarose gel stained with ethidium bromide.
evaluating the quality of the genome assembly.We evaluated the genome assembly quality through the following measures: (i) Confirmation that the assembly result belongs to the target species was made by software BLAST(E-value: 1e-5) 26 comparison to the NCBI nucleotide database (NT library)(Table S2, S3, Supplementary File);(ii) Illumina short reads and PacBio reads were mapped onto the assembled genome using BWA (v.0.7.17-r1188) 67 and Minimap2 68 to evaluate the completeness and accuracy of the genome.The read-mapping rates were 99.27% and 99.74%, and genome coverage rates were 99.7% and 99.98% for the Illumina and PacBio reads, respectively (Table 11), indicating high mapping efficiency and comprehensive coverage.(iii) BUSCO (v5.2.3) 32 analysis was conducted to evaluate the assembly quality based on the mollusca_odb10 database.Using BUSCO analysis, 100% (5295/5295) of complete BUSCO genes were found in the assembly, including 88.6% complete BUSCOs, 85.8% complete and single-copy BUSCOs, and 2.8% complete and duplicated BUSCOs (Table 12).
evaluating the quality of the genome annotation.BUSCO (v5.2.2) 32 was used to evaluate the completeness of the genome annotation.The reference BUSCO database was mollusca_odb10.Among the 5295 BUSCO groups searched, 4575 (86.4%) of the complete BUSCOs were detected in the genome annotations (Table 12).

Fig. 2
Fig.2Frequency distribution of sample's K-mer depth and K-mer species.

Fig. 3
Fig. 3 Chromosomes Hi-C heatmap of Sinosolenaia oleivora.Blocks represent height pseudochromosomes.The color bar represents contact density from white (low) to red (high).The same applies to Fig. 4.

Fig. 6
Fig.6 Numbers of gene families for expansion and contraction in Sinosolenaia oleivora.The green number represents the number of gene families that have expanded during the evolutionary process of a species, whereas the red number represents the number of gene families that have contracted.

Table 1 .
Statistics for the sequencing data of the Sinosolenaia oleivora genome.

Table 3 .
Statistics of Hi-C assembly results of Sinosolenaia oleivora.

Table 4 .
Statistics of repetitive sequences in the Sinosolenaia oleivora genome.

Table 5 .
Statistics of transposable elements for the Sinosolenaia oleivora genome.

Table 6 .
Statistics of gene predictions in the Sinosolenaia oleivora genome.
USA) for transcriptome sequencing.The RNA-seq library was generated using NEBNext ® Ultra TM RNA Library Prep Kit (NEB, USA) for PE sequencing, and short reads were produced on the Illumina NovaSeq 6000 platform.A total of 192.1 Gb of Illumina data, 63.2 Gb of PacBio data, 191.8 Gb of Hi-C data, and 5.6 Gb RNA-Seq data were obtained (Fig.1, Table1).

Table 7 .
Functional annotations of predicted genes.

Table 8 .
Non-coding RNA annotation of the Sinosolenaia oleivora genome.

Table 11 .
The alignment of Illumina and PacBio reads to Sinosolenaia oleivora.

Table 12 .
BUSCO analysis results of the Sinosolenaia oleivora genome.